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Abstract 

We describe quantum tomography as an inverse statistical problem and 
show how entropy methods can be used to study the behaviour of sieved 
maximum likelihood estimators. There remain many open problems, and 
a main purpose of the paper is to bring these to the attention of the 
statistical community. 



1 Introduction 

It is curious that it took more than eighty years from its discovery till it was 
possible to experimentally determine and visualize the most fundamental object 
in quantum mechanics, the wave function. The forward route from quantum 
state to probability distribution of measurement results has been the basic stuff 
of quantum mechanics textbooks for decennia. That the corresponding math- 
ematical inverse problem had a solution, provided (speaking metaphorically) 
that the quantum state has been probed from a sufficiently rich set of direc- 
tions, had also been known for many years. However it was only in 1993, with 
|14|. that it became feasible to actually carry out the corresponding measure- 
ments on one particular quantum system — in that case, the state of one mode 
of electromagnetic radiation (a pulse of laser light at a given frequency). The 
resulting pictures have since made it to the front covers of journals like Nature 
and Science, and experimentalists use the technique to establish that they have 
succeeded in creating non-classical forms of laser light such as squeezed light and 
Schrodinger cats. The experimental technique we are referring to here is called 
quantum homodyne tomography: the word homodyne referring to a compari- 
son between the light being measured with a reference light beam at the same 
frequency. We will explain the word tomography in a moment. 

The quantum state can be represented mathematically in many different 
but equivalent ways, all of them linear transformations on one another. One 
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favourite is as the Wigner function W: a real function of two variables, integrat- 
ing to plus one over the whole plane, but not necessarily nonnegative. It can be 
thought of as a "generalised joint probability density" of the electric and mag- 
netic fields, q and p. However one cannot measure both fields at the same time 
and in quantum mechanics it makes no sense to talk about the values of both 
electric and magnetic fields simultaneously. It does, however, make sense to talk 
about the value of any linear combination of the two fields, say cos(0)g + sin(0)p 
(in fact, one can think of <p as simply representing time). And one way to think 
about the statisticial problem is as follows: the unknown parameter is a joint 
probability density W of two variables Q and P. The data consists of indepen- 
dent samples from the distribution of (X, <&) = (cos($)Q + sin($)P, $), where 
3> is chosen independently of (Q, P), and uniformly in the interval [0, n]. Write 
down the mathematical model expressing the joint density of {X, in terms 
of that of (Q,P). Now just allow that latter joint density, W, to take negative 
as well as positive values (subject to certain restrictions which we will mention 
later). And that is the statistical problem of this paper. 

This is indeed a classical tomography problem: we take observations from all 
possible one-dimensional projections of a two-dimensional probability density. 
The non-classical feature is that though all these one-dimensional projections 
are indeed bona-fide probability densities, the underlying two-dimensional "joint 
density" need not itself be a bona-fide joint density, but can have small patches 
of "negative probability density" . 

Though the parameter to be estimated may look weird from some points 
of view (for instance, when one looks at it "as a probability density"), it is 
mathematically very nice from other points of view. For instance, one can also 
represent it by a matrix of (a kind of) Fourier coefficients: one speaks then of 
the "density matrix" p. This is an infinite dimensional matrix of complex num- 
bers, but it is a positive and selfadjoint matrix with trace one. The diagonal 
elements are real numbers summing to one, and forming the probability distri- 
bution of the number of photons found in the light beam (if one could do that 
measurement). Conversely, any such matrix p corresponds to a physically possi- 
ble Wigner function W , so we have here a concise mathematical characterization 
of precisely which "generalized joint probability densities" can occur. 

The initial reconstructions were done by borrowing analytic techniques from 
classical tomography — the data was binned and smoothed, the inverse Radon 
transform carried out, followed by some Fourier transformations. At each of a 
number of steps, there are numerical discretization and truncation errors. The 
histogram of the data will not lie in the range of the forward transformation 
(from quantum state to density of the data). Thus the result of blindly ap- 
plying an inverse will not be a bona-fide Wigner function or density matrix. 
Moreover the various numerical approximations all involve arbitrary choices of 
smoothing, binning or truncation parameters. Consequently the final picture 
can look just how the experimenter would like it to look and there is no way to 
statistically evaluate the reliability of the result. On the other hand the various 
numerical approximations tend to destroy the interesting "quantum" features 
the experimenter is looking for, so this method lost in popularity after the initial 
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cnthousiasm. 

So far there has been almost no attention paid to this problem by statisti- 
cians, which is a shame, since on the one hand it is one of the most important 
statistical problems coming up in modern physics, and on the other hand it is 
"just" a classical nonparametric statistical inverse problem. The unknown pa- 
rameter is some object p, or if you prefer W, lying in an infinite dimensional lin- 
ear space (the space of density matrices, or the space of Wigner functions; these 
are just two concrete representations in which the experimenter has particular 
interest). The data has a probability distribution which is a linear transform 
of the parameter. Considered as an analytical problem, we have an ill-posed 
inverse problem, but one which has a lot of beautiful mathematical structure 
and about which a lot is known (for instance, close connection to the Radon 
transform). Moreover it has features in common with nonparametric missing 
data problems (the projections from bivariate to univariate, for instance, and 
there are more connections we will mention later) and with nonparametric den- 
sity and regression estimation. Thus we think that the time is ripe for this 
problem to be "cracked" by mathematical and computational statisticians. In 
this paper we will present some first steps in that direction. 

Our main results will be consistency theorems for two estimators. Both es- 
timators are based on approximating the infinite dimensional parameter p by 
a finite dimensional parameter, in fact, thinking of p as an infinite dimensional 
matrix, we simply truncate it to an N x N matrix where the truncation level 
N will be allowed to grow with the number of observations n. The first esti- 
mator employs some analytical inverse formulas expressing the elements of p 
as mean values of certain functions of the observations (X, $). Simply replace 
the theoretical means by empirical averages and one has unbiased estimators 
of the elements of p, with moreover finite variance. If one applies this tech- 
nique without truncation the estimate of the matrix p as a whole will typically 
not satisfy the nonnegativity constraints. The resulting estimator will not be 
consistent either, with respect to natural distance measures. But provided the 
truncation level grows with n slowly enough, the truncated estimate will satisfy 
the constraints, and provided it grows fast enough, the overal estimator will be 
consistent. 

There are many unbiased estimators of the matrix elements of p and the 
choice we make is based on analytic tractability, not on any optimality criteria. 

The second estimator we study further exploits the same idea, in a more 
canonical way: we study the sieved maximum likelihood estimator based on the 
same truncation to a finite dimensional problem. The truncation level N needs 
to depend on sample size n to balance bias and variance. We prove consistency 
of the sieved mle under an appropriate choice of N(n) by applying a general 
theorem of |18| . In order to verify the conditions we need to bound certain metric 
entropy integrals (with bracketing) which express the size (infinite-dimensional- 
ness) of the statistical model under study. 

This turns out to be feasible, and indeed to have an elegant solution, by 
exploiting features of the mapping from parameters (density matrices) to distri- 
butions of the data. Various distances between probability distributions possess 
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analogues as distances between density matrices, the mapping from parameter 
to data turns out to be a contraction, so we can bound metric entropies for the 
statistical model for the data with quantum metric entropies for the class of 
density matrices. And the latter can be calculated quite conveniently. 

Our results form just a first attempt at studying the statistical properties of 
estimators which are already being used by experimental physicists, but they 
show that the basic problem is both rich in interesting features and tractable to 
analysis. The main result so far is a consistency theorem for a sieved maximum 
likelihood estimator, which depends on an assumption of the rate at which a 
truncated density matrix approximates the true one. It seems that the assump- 
tion is satisfied for the kinds of states which are met with in practice. However, 
further work is needed here to describe in physically interpretable terms, when 
the estimator works. Secondly, we need to obtain rates of consistency and to 
further optimize the construction of the estimator. Thirdly, one should explore 
the properties of penalized maximum likelihood, and if possible make it adap- 
tive to the rate of approximation of the truncated model, so that the truncation 
level N(n) is determined from the data. 

We largely restrict attention to an ideal case of the problem where there is 
no further noise in the measurements. In practice, the observations have added 
to them Gaussian disturbances of known variance. There are some indications 
that when the variance is larger than a threshold of 1/2, reconstruction becomes 
impossible or at least, qualitatively much more difficult. This needs to be re- 
searched from the point of view of optimal rates of convergence. The threshold 
should not be an absolute barrier for sieved or penalized maximum likelihood, 
though it may well have qualitative impact on how well this works. 

We also only considered one particular though quite convenient way of siev- 
ing the model, i.e., one particular class of finite dimensional approximations. 
There are many other possibilities and some of them might allow easier analysis 
and easier computation. For instance, instead of truncating the matrix p in a 
given basis, one could truncate in an arbitrary basis, so that the finite dimen- 
sional approximations would corespond to specifying N arbitrary state vectors 
(eigenvectors) and a probability distribution over these "pure states" . Now the 
problem has become a missing data problem, where the "full data" would assign 
to each observation also the label of the pure state from which it came. In the 
full data problem we need to reconstruct not a matrix but a set of vectors, to- 
gether with an ordinary probability distribution over the set, so the "full data" 
problem is statisticially speaking a much easier problem that the missing data 
problem. One could imagine that the EM algorithm, or Bayesian reconstruction 
methods, could exploit this structure. 

We concentrated on estimation of p but it would also be interesting to obtain 
results on estimation of W . The analogy with density estimation could suggest 
new statistical approaches here. Finally, it is most important to add to the 
estimated parameter, estimates of its accuracy. This is absolutely vital for 
applications, but so far no valid approach is available. 

The quantum mathematical physics of this problem is identical to that of 
the quantum simple harmonic oscillator, where q and p stand for position and 
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momentum of a particle, oscillating inside a quadratic potential well. In the 
next section we describe this mathematics using the terminology of position 
and momentum. 

Section 3 is devoted to the ad hoc estimator based on truncation of p, and 
Section 4 to the sieved maximum likelihood estimator. That section finishes 
with some concluding remarks to the whole paper. 



2 Quantum systems and measurements 

In classical mechanics the state of macroscopic systems like billiard balls, pen- 
dulums or stellar systems is described by its "coordinates" in a phase space, 
each coordinate corresponding to an attribute which we can measure such as 
position and momentum. Therefore the functions on the phase space are called 
observables. When there exists uncertainty about the exact point in the phase 
space, or we deal with a statistical ensemble, the state is a probability distri- 
bution and the observables become random variables. Quantum mechanics also 
deals with observables such as position and momentum of a particle, spin of an 
electron, number of photons in a cavity but breaks from classical mechanics in 
that these are no longer functions but selfadjoint operators on a complex Hilbert 
space TL with inner product (•, •) which is linear in the right slot and anti-linear 
in the left one. For example, the components in different directions of the spin 
of an electron are certain selfadjoint operators on C 2 , or hermitian 2x2 matrices 
which do not commute with each other. Another quantum system with which 
we will deal in this paper is the quantum particle. Its basic observables position 
and momentum, are two unbounded selfadjoint operators Q and P respectively 
acting on the complex Hilbert space TL — L 2 (R) as 

(Qipi)(x) = xipi(x), 
.dtp 2 (x) 



(PV> 2 )(z) 



dx 



for Vi>^2, vectors in their respective domains. The operators satisfy Heisen- 
berg's canonical commutation relations QP — PQ = il. We note that the 
algebra generated by all bounded functions of Q and P is dense in the space of 
bounded operators B(TL) with respect to the weak operator topology, defined 
by the seminorms | (il>, -ip) | for all ip £ TL. For this reason B(TL) is usually 
considered the algebra of observables of the system. 

The state of the quantum system is given by a density matrix p, i.e. a 
positive trace-class operator with Tr(p) = 1. This is analogue to the probability 
distribution on the phase space, the expectation of an observable X G B{TL) 
being given by E p (X) := Tr(pX). The states form a convex subset S(TL) of the 
space of trace-class operators on TL, the latter being denoted T\{TC). The lower 
script 1 refers to the norm 

IMIi:=Tr(|r|), (2.1) 
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with respect to which Ti(7i) is a Banach space. If r is selfadjoint then it can 
be represented as an infinite diagonal matrix (in a certain basis of the space H.) 
with elements Tj, thus ||r||i : J2i \ T i\- Any state can be written, in general non- 
uniquely, as convex combination of pure or vector states which have expectations 
of the form 

E^(X) := Tr(P^X) = (V,X^) (2.2) 

where P,/, is the orthogonal projection on the space Cip. There exists a duality 
relation 

7i(W)* =B(H) 

which is the non-commutative analogue of l\ — l^. 

But how do we actually measure an observable of the system? This is in 
general a difficult question from the practical point of view, as we will see in this 
paper only certain observables can be measured with the present technology. 
But we can describe how the probability distribution of the results will look 
if we perform the measurement. Any selfadjoint operators X has a spectral 
decomposition or "diagonalization" 

X = S ' xi"Pi 

i€<r(X) 

where the sum is taken over the spectrum of X and Pj is the projection as- 
sociated to the eigenvalue a;*. The sum should be replaced by an integral for 
operators with continuous spectrum. If the system is in the state p then the 
probability of obtaining the value Xi is 

Vp [i) = Tr(pPi). (2.3) 

which depends only on the spectral projections, the eigenvalues Xi being just la- 
bels of the results. More realistic measurements are modeled by positive operator 
valued measures (POVM) which are maps M from the c-algebra of a measure 
space (0 M , Em) into B{H) with the following properties: M(A) = M(A)* > 
for any A G Em, M(UjAj) = J2i M(Aj) for a countable number of arbitrary 
disjoint Ai e Em, and M(Am) = 1- Similarly to the projection valued case, 
the probability distribution of the results is 

pW(A)=Tr( P M(A)). 

An important feature of the map p i— > Pp M ^ is that it is contractive in appropri- 
ate norms. The total variation distance between two probability distributions 
on (SJm^m) is defined by 

dt v {Pi,P2) ■= sup 

|F|<1 



F{x)P 1 (dx)- F{x)P 2 {dx) 



(2.4) 



G 



Then 



(p(M) )P (M)) 



, PS = sup 
7 l*1<i 



P(:k)P p (m3 (^) 



F{x)Pj™\dx) 



sup 

I-PI<1 



Tr((p-p') / F(x)M(dx)) 



<\\P-P\\ 



where in the last step we have used the fact that J F(x)'M.(dx) < 1 and then we 
applied the inequality |Tr(rY)| < ||t||i||Y|| for all r trace-class and Y bounded, 
in which the reader recognizes its classical counterpart \ J fg\ < ||/||i||<?||oo- 

Notice that we are merely concerned here with the distribution of the results 
and do not specify the state of the system after the measurement. The "no 
quantum cloning theorem" shows that measurements on a single system cannot 
completely reveal its state, in other words if the state is left unchanged after 
measurement then the results do not give us any information on the state. 



We can now formulate our problem in the following way: we have at our 
disposal a large number of systems identically prepared in an unknown state 
p, on each one of them we can perform a certain measurement, and we want 
to construct an estimator of p based on the measurement results. Suppose for 
simplicity that we make the same measurement M on all particles, then the 
results are i.i.d. random variables Xi,X2, ... on (^m, 5]m) with distribution 
Pp 1 . We will be interested in identifiable models, meaning that the map Tm : 
p i ► p™ is one-to-one. For further details on quantum statistical inference we 
refer to the review [2] and the classical textbook . 



3 Quantum homodyne tomography 

Let us return to the quantum particle described by the observables Q and P 
satisfying the canonical commutation relations QP — PQ = 1. The problem 
of measuring observables other than position and momentum has been elusive 
until ten years ago when pioneering experiments in quantum optics conducted by 
Raymer's group j!4j lead to a powerful measurement technique called homodyne 
tomography. The quantum system to be measured is laser light with a fixed 
frequency whose observables are the field amplitudes satisfying commutation 
relation identical to those which characterize the quantum particle. Their linear 
combinations X^ = cos 0Q + sin <ftP are called quadratures, and homodyne 
tomography is about measuring the quadratures for an arbitrary phase (j) 6 [0, n]. 
The experimental setup consists of an additional laser of high intensity called 
local oscillator (LO), which is combined with the mode of unknown state through 
a fifty-fifty beam splitter, and two photon detectors each one measuring one of 
the emerging beams. Then a rescaled difference of the measurement results 
turns out to have the same probability distribution as that of the quadrature 
X^, in the limit of infinite intensity LO. It can be shown that the probability 
distribution P p (-.<fi) on R has density p p (-,4>) with respect to the Lebesgue 



7 



measure and generating function 



E(e ltJi \(f>) = Tr(pe^). (3.1) 

The phase <p is controlled by the experimenter by adjusting a parameter of the 
local oscillator, and it will be assumed to be chosen randomly uniformly dis- 
tributed over the interval [0, it]. Then the joint probability distribution for the 
pair consisting in measurement results and phases (X, $) has density p p (x,<j>) 
with respect to the measure dx x ^ on R x [0, 7r]. A natural way of represent- 
ing the state p is by writing down its matrix elements pij := (ipi, pipj) in an 
orthonormal basis of the Hilbert space L2 (R) , for example 

where Hi are Hermite polynomials. This basis has a special relevance in quan- 
tum optics, ipi(x) being pure states of exactly i photons. Here is the concrete 
formula for p p in terms of Pj.k'- 

00 

p P (x,<f>)= J2 P] .Mx)^{xy- l( °- k) * ■ (3-3) 

j,fc=0 

An important feature of this homodyne detection scheme is the invertibility 
of the map T : p — * p p (-, •), making it theoretically possible to infer the state 
of the system from the knowledge of the distribution of results. This was not 
possible had we measured only a finite number of quadratures! But what is 
the connection of this method with the more familiar computerized tomography 
used in the hospitals? Well, physicists like to represent the state of a quantum 
system by a certain function on R 2 called the Wigner function W(q,p) which 
is much like a joint probability distribution for P and Q in the sense that its 
marginals are the probability distributions for measuring Q and respectively P. 
Of course the two observables cannot be measured simultaneously so we cannot 
speak of a joint distribution, in fact the Wigner function need not be positive 
but many interesting features of the quantum state can be visualized in this way. 
It turns out that p p (x, 4>) is the Radon transformation of the Wigner function 

/oo 
W{q cos <j) + P sin 0, q sin <j) — p cos 4>)dp. 
-00 

The Radon transformation and its inverse play a distinguished role in comput- 
erized tomography. Here one reconstructs a "shape" , for example the spatial 
distribution of the absorption coefficient for X-ray in a cross-section of the hu- 
man body, by recording the transmitted radiation along an axis perpendicular 
to the beam and repeating this with the apparatus rotated at different angles. 
In our case the Wigner function is the unknown function while the probability 
density p p represents the transmitted angle-dependent signal. The term optical 
homodyne tomography was coined in 1993 |14j when the first Wigner function 
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was reconstructed from experimental data using the homodyne scheme. The 
Fourier transform of the Wigner function has the following expression 

W(u,v) = Tr (pe- iuQ - ivF ) , 

and we note that if Q and P were commuting operators then W(q,p) would 
indeed be the joint probability distribution of outcomes of their measurement. 
Finally, from W(u, v) we can obtain the matrix elements of the state p with 
respect to a fixed orthonormal basis by integrating with certain kernel functions 
|1(J| . In practice this procedure has its drawbacks because it involves "filtering" 
the data as in usual tomography which as argued in \7\ amounts to tampering 
with the state that is, making it more "classical". 

In [7] D'Ariano et al. presented a technique which provides the matrix 
elements without calculating the Wigner function as an intermediary step. The 
method has been further analyzed in The key formula shows that any 

operator r G Ti{TL) can be expressed as a linear superposition of functions of 
the observables X^: 

1 /"°° j^i^i r d< Pm_/_ArX J> \„-ir]t i 



d r | r | / -LTr(Te lr ^)e- tr ^. (3.4) 

4 J-oo Jo T 

which is an application of the general theory of quantum tomography developed 
by D'Ariano and his collaborators [510]. By applying this formula to the state 
p and using (|3.1|) we get 



9=1 dx { —p p (x,(j))K(x~X ctj ), 

J-oo JO I 1 



where K is the generalized function given by 

lv± 

2 x 2 e^o+ (x + ie) 



K(x) = ~-V— = - lira Re j- ■ , ^ . (3.5) 



In order to obtain a mathematically sound expression we take the matrix ele- 
ments on both side 



Pk,j = dx 



r % p (x,tl>)f k j{x)e-^- k ^, (3.6) 





with /fcj bounded functions which in the quantum tomography literature are 
called pattern functions. A first concrete expression using Laguerre polynomials 
was found in [5], and was followed by a more transparent one 

fk.k+d(x, 4>) = e^^-(M^d + k(x)), (3.7) 
dx 

in terms of the basis vectors V'fc and a certain un-normalizable solution of the 
Scrodinger equation 



\£_ 1 2 

+ 2 X 



tp j = u j (p j , (3.8) 
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Equation (|3.6() suggests the unbiased estimator of p based on the first n i.i.d. 
results (Xi,$>i) whose matrix elements are |51 1121 ITT] : 

1=1 

By the strong law of large numbers the individual matrix elements of this es- 
timator converge to the matrix elements of the true parameter p and has the 
advantage that it can be computed in real time. The disadvantages are that 
the matrix p( n ' as a whole need not be positive, normalized or even trace-class, 
and one has no control on the convergence p^ — ► p in any relevant distance 
such as for example || • ||i due to the infinite number of matrix elements and 
ranges of the pattern functions /i,i+<j increasing with i and d. We can avoid 
this problem by choosing p^ to be an effectively finite dimensional selfadjoint 
matrix of dimension N(n) growing with n that is, p\ 7 ] > +d = for i + d > N(ri), 

and Pl^ +d given by (|3.9(1 for i + d < N(n). We apply now Hoeff ding's inequality 
for the matrix elements, 

- 9iM > a) < exp ( ; na \ ) , (3.10) 

\ || Ji,i+d\\oo / 

and let p^ denote the restriction of the true density matrix to N(n) dimensional 
subspace on which p( n ) is non-trivial. We will look at the |j • ^-distance defined 
in general by 

||r-r'||2 ; =Tr(|r-r'| 2 )= £ 1^ - rj, fc | 2 . 

j,fe>0 

Then from (|3.10|) we obtain 

n\\p {n) -P {n) h >a)< iV(n) 2 exp ( '"f ) ■ (3-11) 



Lemma 3.1 The following holds: 

N 

£ WhAlo = 0(N^). (3.12) 

Proof. We refer to the paper JT] for a more detailed analysis of the functions 
ipki fj an d we mention here only some qualitative features. Let e > be fixed. 
The Plancherel-Rotarch formulas |15j give asymptotic formulas for ipk and (pk 
in three regions of M: the "classical region" \x\ < ey2fc + 1 where both have an 
oscillatory behavior and have absolute values bounded by the envelope function 
■\/2/7r(l — a; 2 ) -1 / 4 , the "classically forbidden region" \x\ > e\j2h + 1 in which ifik 
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decays as x k e x / 2 while ifk grows as i k 1 e x / 2 , and the "transition region" 
with width efc -1 / 6 centered around the turning point \/2k + 1 in which 

if> k (x) = 2 1 / 4 fc- 1 / 12 Ai (^fc 1 / 6 ^ - V2fcTT)) (3.13) 

and similarly for ipk with the Airy function pQ Ai replaced by Bi. 

The range of the pattern functions fkj increases slowly with the distance 
to the diagonal j — k, thus the main contribution in l|3.12|l is brought by terms 
which lie away from the diagonal. Let C be a fixed constant, then for the pattern 
function f^j situated in the upper corner j > Ck, the maximum is attained in 
the overlap of the classical region for tpj with the transition region of ipk, and 
can be estimated by using the Plancherel-Rotarch formulas 

II/mIU = o(^)- (3-14) 

We sum now over the upper corner to obtain asymptotic behavior of the sum 

□ 

In particular we have the following necessary condition for the || • || 2 -consistency: 

n- l N{n) 7 ^ -»• 0, as n -> oo. (3.15) 
Theorem 3.2 Let (e n ,N(n)) be such that e„ — ► 0, N(n) — > oo and 



- 2\ogN(n) oo. (3.16) 



Then 

Moreover if 



N(n) 7 / 3 

\\p [n) - P\\l = \\p [n) - p\\l + O r {el). (3.17) 

00 / 2 \ 

E ex p(-^73+ 21 °g iV Hj <°° ( 3 - 18 ) 



then — p\\2 — * almost surely. 

Proof. The first statement follows directly from l|3.11|l and the fact that \\p^ n ' — 
HI 2 = II/ 9 '™' — p\\2 + Wf 1 ^ ~ P III- The almost sure convergence follows from 
the first Borel-Cantelli lemma. 

□ 

The homodync tomography as presented in the beginning of this section 
does not take into account various losses (mode mismatching, failure of detec- 
tors) in the detection process which modify the distribution of results in a real 
measurement compared with the idealized case. Fortunately, an analysis of such 
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losses ^U] shows that they can be quantified by a single efficiency coefficient 
< 77 < 1 and the change in probability distributions amounts replacing Xi by 



X[ := y/rjXi + y/(l -Ti)/2Yi (3.19) 

with Yi a sequence of i.i.d. standard Gaussian independent of all Xj. The 
efficiency-corrected probability density is then 



dc 



V I -1/2 \2 

[x -77 ' y) 

1-77 



etc. (3.20) 



The problem is again the inference of the parameter p from $1), (X2, $2). 
One could follow two routes: use a deconvolution technique for the variable X 
to obtain p p and then apply the previous kernel estimator for p, or find new 
pattern functions fk,j(x]v) such that 



DO 



pk,j = dx — p p (x,(j);ri)fk,j(x;r]). (3.21) 
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Such functions are analyzed in [21 El where it is argued that the the method 
has a fundamental limitation for 77 < 1/2 in which case the pattern functions 
are unbounded, while for 77 > 1/2 numerical calculations show that their range 
grows exponentially fast with both indices j, k. However there exists no proof 
of the conjecture which is implicitly made in the literature that it is impossible 
to estimate p consistently for 77 < 1/2. A third route is to first estimate an 
intermediary state p( meas ) as in the r\ = 1 case, and then to obtain p from 
^(moas) inverting a Bernoulli transformation |1(J) : 

p ( v;r? ) j , ^(mcas) inverse Bernoulli ^ ^ (3 22) 

To understand the (inverse) Bernoulli transformation let us consider first the 
diagonal elements {pk — Pk,k, k = 0, 1..} and {qj — Pjj e3S \ j = 0, 1..} which 
are both probability distributions over N and represent the statistics of the 
number of photon in the two states. Let b k k +v — ( fe ^ p )ry fc (l — if) p be the binomial 
distribution. Then 

oo 

q^Y.b'MPk (3.23) 

k=j 

which is interpreted as result of the "absorption" process by which each photon 
is allowed to pass with probability rj and absorbed with probability 1 — 77. The 
general formula is 

00 1/2 

Pt^ = E [ b 3 +P (v)b k k +P (v)} ft+M+p, (3.24) 

and its inverse is obtained by replacing 77 with 77" 1 \ For 77 < 1/2 the power 
series (1 — i]~ 1 ) k appearing in the inverse transformation diverges, reflecting the 
obstruction for obtaining bounded pattern functions fk,j(x; n). 
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4 Sieve maximum likelihood estimation 



In this section we will develop a maximum likelihood approach to the estimation 
of the state p. Let us remind the reader of the terms of the problem: we are 
given a sequence (Xi, $1), (X 2 , $2) ■ ■ ■ of i.i.d. random variables with values 
in R x [0, 7r] with probability density p p with respect to the Lebesgue measure 
dx x ^ depending on the parameter p £ S(7i). When taking into consideration 
the efficiency 77 < 1 we have replace p p by p p (-, •; rj). We would like to find 

=f> (n) (X 1 , X n ,$ n ), 

such that the || • ||i-consistency holds: 

lim \\p {n) = 0, a.s.. 

n — >oo 

Let p n :— Pp(n) be the corresponding probability density. We denote by 



h(Pi,P 2 ) := 




the Hellinger distance between two probability distributions on (f2, S, p) with 
densities pi,p 2 with respect to p. The following relations are well known 

idtv(i\,ft) < h(P 1 ,P 2 ) < y/dt v (Pi,P 2 ), (4.2) 
and combined with l|2.5[) give in the case of our measurement 

h(P T ,P T >) < V\\r-T'\\i (4.3) 
for arbitrary states r, r' 6 S(H). As a consequence, the Hellinger consistency 

lim h(P n ,P p ) 0, a.s., (4.4) 

is weaker than the || • || i-consistency. 

The maximum likelihood estimator is usually defined as the parameter r 
which maximizes the log-likelihood Xm=i ^°SP-r{X a , & a ). In this case the max- 
imum is not achieved over the whole space and it seems more appropriate to 
restrict the attention to a subspace Q(n) on which the maximum exists, whose 
size grows with the number of data and such that U„>iQ(n) is dense in S(Ti) 
in the norm topology. Such a method is called sieved maximum likelihood and 
we refer to ^JE] for the general theory. The choice of the spaces Q{n) should 
be tailored according to the problem one wants to solve, the class of states 
one is interested in, etc. We will use here the number states sieves for which 
Q(n) consists of density matrices over the subspace spanned by the basis vectors 
ipo, . . . , "0jv(n) defined in (|3.2(l . with N(n) an increasing function of n which will 
be fixed later: 

Q{n) ={t£ Ti (H) : r jt k = for all j > N(n) or k > N(n)} . (4.5) 
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The dimension of the space Q(n) is N(n) 2 . Let 



p(n) .- &rg max Vlog Pr (X a ,$ a ), (4.6) 

a— 1 

and notice that by compactness arguments the maximum always exists. 
We define the convex map 

T : S(H) 3 t i — >p T e Li(M x [0,7r], ete x — ), (4.7) 

7T 

whose image V is the class of probability densities of the form Ij^.HJI . but for the 
moment we lack a more intrinsic characterization of its elements. The image of 
the sieve Q(n) is the convex hull V{n) of densities of the form 



N(n) 

J2 a k e* k Uk(x) 

k=0 



(4.8) 



with ip — X)fc!Io' ) ^kipk a um t vector. 

In order to obtain results on consistency of estimators, it is essential to bound 
the "size" of the sieve by entropy numbers which we define here for with respect 
to the || • ||i-distance. 

Definition 4.1 Let J- be a class of probability densities. Let Nb,i{5, T) be the 
smallest value of p € N for which there exist pairs of functions {[f^ffW with 
j = 1, . . . ,p such that \\fj — fjj ||i < 5 for all j, and such that for each f 6 T 
there is a j = j(f) <E {1, . . . ,p} such that 

ff<f< ff- 

Then Hb,i{8, T) — logNB,i(S, J 7 ) is called 5-entropy with bracketing of T . 

We note that this definition relies on the concept of positivity and distance 
between Li-functions. But the same notions exist for the space of trace-class 
operators 7[(Tl), thus by replacing probability densities with density matrices 
and functions with selfadjoint trace class operators we obtain the definition of 
the 5-entropy with bracketing Hb,i(5, 2) for some space of density matrices Q. 



Proposition 4.2 Let Q(n) be the class of density matrices of dimension N(n). 
Then 

%(«, Q(n)) < CN{nf log (4.9) 

a 

for some constant C independent of n and 5. 
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Proof. 



Let {pj, j = 1, . . . ,c(5, n)} be a maximal set of density matrices in Q( 



n 



such that for any j ^ k we have lift — Pfe||i > 2 jv(ti) • ^ e define 

Then for any p in the ball B\{pj, 2 m n ) ) we have p — p k < 2 jVfa) ^' ^ us 

and clearly — P^Wi ~ ^ ^ remains to estimate the number of balls c(<5,n). 
From standard arguments on dimension we obtain 

where the difference on the right side represents the volume between two balls 
of radii 1 — , J, , and 1 + . J, ; . As a rough estimation we obtain 



C (^)<(l + ^)^ 2 <(5^f) 



N(n) 



The bracketing entropy is at most log c(6, n) and we obtain (|4.9() with C = 
1 + log 5. 

□ 

Corollary 4.3 Let "P(rt) 1 / 2 6e </ie cZass of L2-functions {^/Wp '■ Pp G ^ > ( rl )} 
and Hb(S, 'P(n) 1 / 2 ) &e i/ie bracketing entropy with the \\ ■ ^-distance. Then 

H Bil (S,V(n)) < N(n) 2 log^P- (4.10) 

o 

HB^Vin) 1 / 2 ) <iV(n) 2 log^ (4 .H) 

Proo/. 

Let T be the linear extension to Ti(Tt) of the map T. Then T is positivity 
preserving that is, for any r, r' G T~i(Tt) such that r > r' then p T > p T / where 
we extend the notation p T = T(r) to all trace-class operators. Let [pf , pj) be 
the (^-bracketing matrices from the previous proposition. Then by the above 
observation [f (pf ), f (pf)] is a set of brackets for V{n) = T(Q(n)). From the 
monotonicity on the || • ||i proved (|2.5|) we obtain ||X(p^) - T(p^)||i < S. 

For the second inequality we note that [T(p^) 1 / 2 , (T(py) + ) 1 / 2 ] is a set of 
brackets for 7 3 (n) 1/ ' 2 and then it can be shown than 

\\t(pW 2 - (f( P f) + m < 5 -. 
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□ 

We will concentrate now on the Hellinger consistency of the sieve maximum 
likelihood estimator P n . We will appeal to a theorem from ^2] , which is similar 
to other results in the literature on non-parametric M-estimation (see for exam- 
ple QUI)- There are two competing factors which contribute to the connvergece 
of h(P n , P p ). The first is related with the approximation properties of the sieves 
with respect to the whole parameter space. Such a "distance" from p to the sieve 
Q(n) can take different expressions, for example in terms of Kullback-Leibler 
divergence K(q,p) := J plog£, 



S n (0+) := inf K( Pp , lPp ) 
jo'eQ(n) 



and 



r„ = lim 

k — ►QC 



/pp(log^) 2 
J Pk 



(4.12) 



(4.13) 



where {pk, k = 1,2, . . .} C V(n) is a sequence such that lim^oo K(qk,p p ) 
5«(0+). Another natural rate which will be used later is 



7n= inf \\p~p'\\i- 

p'eQ(n) 



(4.14) 



Notice that all this numbers depend on the growth rate of the sieve N(n). 
The second factor influencing the convergence of h(P n ,P P ) is the size of the 
sieves which is expresses by the bracketing entropy. The non-parametric m.l. 
estimation theory shows that the following entropy integral inequality plays an 
important role in determining the rate of convergence 



J B {5,V 1/2 {n)):= H 1 B /2 (-,V(n) 1 / 2 )du<c±Vn'6 2 . (4.15) 

J8 2 /2 8 c 3 

, 4 such that if S n is the small- 



Theorem 4.4 There exist constants Ci,i = 1, 
est value satisfying \4-15\j and we define 

if S n (0+) < \ Cl & 



(4<5„(0+)/ Cl ) 



1/2 



otherwise 



then 



4t„ 



cinet. 



p(h(P n ,P p ) > e n ) < 5e 
In calculating the entropy integral we take into account (|4.11|l 

i-S / iv77„U/2 \ V 2 



(4.16) 



Jb^V 1 ' 2 ^)) = O 



o 



o 



N(n) 
N{nf' 2 
N(n)S ( lo. 



log 



N( n y 



du 



N(n)/8 2 



N(ny/ 2 /S 

N(n)^ 1/2 



w 2 {\ogw) 1 / 2 dw 



(4.17) 
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From the entropy inequality we obtain the rate 5 n satisfying 




(4.18) 



Theorem 4.5 Suppose that the state p satisfies t„ = 0(N(n) T ) for some r > 
0. Let p( n ) be the sieve MLE with N(n) = o(( T5 |^) 1/2 ) and iV(ri)- 1 = o{n~ e ) 
for some 9 > 0. Then p n is Hellinger consistent, i.e. 

h(P n ,P p )^0 a.s.. (4.19) 

Proof. We apply Theorem l4.4l to our particular situation. We can choose a rate 
S n — > satisfying (|4.18() for our particular choice of N(n) and decreasing slower 
that 1/logn. Then 




because the lower bound for N(n) and the class assumption imply that T n 
decreases faster than some power of n. A standard application of the first 
Borel-Cantelli lemma proves almost sure convergence of h(P n , P) — > 0. 

□ 

From the physical point of view, we are more interested in the convergence 
of the state estimator p^™' which is in principle a stronger requirement than 
Hellinger consistency. We will show however that the two are equivalent by 
applying a quantum analogue of the classical Scheffe's lemma |Ej which says 
that if a sequence of probability densities converge pointwise almost everywhere 
to a probability density, then they also converge in || • ||i. We will replace the L\ 
space by the space of trace-class operators T\ (H), and the pointwise convergence 
by weak operator convergence which is roughly (?/>,X n V>) — > (^>,X^>) for all 
ip G H- In particular for density matrices it is sufficient to check the individual 
convergence of all matrix elements in a given basis. For the proof and other 
non-commutative convergence theorems we refer to |13j . 

Theorem 4.6 Let p n be a sequence of density matrices converging weakly to 
another density matrix p. Then \\p n — p\\ \ — > as n — > oo. 

Corollary 4.7 The Hellinger consistency of P n is equivalent to the \\ ■ j|i- 
consistency of (P 1 ' . In particular, under the assumptions of Theorem |^.<5| we 
have — p\\i^ 0, a.s.. 

Proof. By Theorem 14.61 it is enough to prove almost sure convergence of each 
matrix element individually. But we have shown in l|3.6l) that p^j and p^ n - can 
be expressed as the integral of p p and respectively p n with bounded pattern 
functions f k tj (x)e~ i ^- k ^. 

□ 
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Concluding Remarks. There are many open questions related to quantum 
tomography and we would like to enumerate a few of them here. 

The equivalence in last corollary holds as well for efficiency r\ > i as we only 
use the fact that the pattern functions are bounded, but seems to fail for 77 < ^ 
when the pattern functions are unbounded. Is r/ = i some kind of transition 
point between two convergence regimes? 

Another problem which has not been treated here is that of rates of con- 
vergence for estimators. A possible way to obtain this is to find the rates e n of 
convergence for h{P n ,P P ) and then to use the modulus of continuity w„(e) of 
the inverse map on the sieves 

T- 1 : V(n) -> Q(n) (4.20) 

to obtain the rough rate U) n (e n ) for \\p( n > — p\\i. This will lead to a slower 
increase of the sieve dimension N(n). Is there a more direct approach to the 
estimation of the rates? Does the maximum likelihood estimator converge faster 
than the kernel estimator using pattern functions presented in section |3f Can 
we use penalization instead of arbitrarily choosing the dimension of the sieve? 

On the practical side of the problem, finding the maximum of the likeli- 
hood function over a set of density matrix is non-trivial. The positivity and 
normalization constraints must be taken into account. 

In the case n < 1 we have to deconvolve the noise introduced by the detection 
imperfection. The analysis made for perfect detection should be made also in 
this case. It seems to us that the conjecture made by D'Ariano referring to the 
impossibility of reconstructing the state for 77 < i is not true in general, but it 
does pose a kind of restriction. One should identify the class of states for which 
the reconstruction is still possible. 

Needless to say, the methods used here for quantum tomography can be 
applied in other problems of quantum estimation, such as for example estimating 
how certain devices transform the states of quantum systems. 
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